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Abstract 

Inductive learning is based on inferring a general rule from a finite data set and using 
it to label new data. In transduction one attempts to solve the problem of using a labeled 
training set to label a set of unlabeled points, which are given to the learner prior to 
learning. Although transduction seems at the outset to be an easier task than induction, 
there have not been many provably useful algorithms for transduction. Moreover, the 
precise relation between induction and transduction has not yet been determined. The 
main theoretical developments related to transduction were presented by Vapnik more than 
twenty years ago. One of Vapnik's basic results is a rather tight error bound for transductive 
classification based on an exact computation of the hypergeometric tail. While tight, this 
bound is given implicitly via a computational routine. Our first contribution is a somewhat 
looser but explicit characterization of a slightly extended PAC-Bayesian version of Vapnik's 
transductive bound. This characterization is obtained using concentration inequalities 
for the tail of sums of random variables obtained by sampling without replacement. We 
then derive error bounds for compression schemes such as (transductive) support vector 
machines and for transduction algorithms based on clustering. The main observation used 
for deriving these new error bounds and algorithms is that the unlabeled test points, which 
in the transductive setting are known in advance, can be used in order to construct useful 
data dependent prior distributions over the hypothesis space. 

1. Introduction 

The bulk of work in Statistical Learning Theory has dealt with the inductive approach to 
learning. Here one is given a finite set of labeled training examples, from which a rule is 
inferred. This rule is then used to label new examples. As pointed out by Vapnik (1998) in 
many realistic situations one actually faces an easier problem where one is given a training 
set of labeled examples, together with an unlabeled set of points which needs to be labeled. 
In this transductive setting, one is not interested in inferring a general rule, but rather only 
in labeling this unlabeled set as accurately as possible. One solution is of course to infer a 
rule as in the inductive setting, and then use it to label the required points. However, as 
argued by Vapnik (1982, 1998), it makes little sense to solve what appears to be an easier 
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problem by 'reducing' it to a more difficult one. While there are currently no formal results 
stating that transduction is indeed easier than induction 1 it is plausible that the relevant 
information carried by the test points can be incorporated into an algorithm, potentially 
leading to superior performance. Since in many practical situations we are interested in 
evaluating a function only at some points of interest, a major open problem in statistical 
learning theory is to determine precise relations between induction and transduction. 

In this paper we present several general error bounds for transductive learning. 2 We 
also present a general technique for establishing error bounds for transductive learning 
algorithms based on compression and clustering. Our bounds can be viewed as extensions 
of McAllester's PAC-Bayesian framework (McAllester, 1999, 2003a, 2003b) to transductive 
learning. The main advantage of using the PAC-Bayesian approach in transduction, as 
opposed to induction, is that here prior beliefs on hypotheses can be formed based on the 
unlabeled test data. This flexibility allows for the choice of "compact priors" (with small 
support) and therefore, for tight bounds. We use the established bounds and provide tight 
error bounds for "compression schemes" such as (transductive) SVMs and transductive 
learning algorithms based on clustering. While precise relations between induction and 
transduction remain a major challenge, our new bounds and technique offer some new 
insights into transductive learning. 

The problem of transduction was formulated as long ago as 1982 in Vapnik's classic 
book (Vapnik, 1982), where the precise setting was formulated, and some implicit error 
bounds were derived. 3 In recent years the problem has been receiving an increasing amount 
of attention, due to its applicability to many real world situations. A non-exhaustive list 
of recent contributions includes (Vapnik, 1982, 1998; Joachims, 1999; Bennett & Demiriz, 
1998; Demiriz & Bennett, 2000; Wu, Bennett, Cristianini, & Shawe- Taylor, 1999; Lanckriet, 
Cristianini, Ghaoui, Bartlett, & Jordan, 2002), and (Blum & Langford, 2003). Most of this 
work, with the exception of Vapnik's bounds (1982, 1998) and the results of Lanckriet et al. 
(2002), has dealt with algorithmic issues, rather than with performance bounds. Implicit 
performance bound, in the spirit of Vapnik (1982, 1998) has recently been presented by 
Blum and Langford (2003). We mention this result again later in Section 2.2. 

We present explicit PAC-Bayesian bounds for a transductive setting that considers sam- 
pling without replacement of the training set from a given 'full sample' of unlabeled points. 
This setting is proposed by Vapnik and it turns out that error bounds for learning algo- 
rithms, within this setting, imply the same bounds within another setting which may appear 
more practical (See Section 2.1 and Theorem 2 for details). Sampling without replacement 
of the training set leads to the training points being dependent (see Section 2.1 for details). 
Our first goal is to provide uniform bounds on the deviation between the training error 
and the test error. To this end, we study two types of bounds that utilize two different 
bounding techniques. The first approach is based on an observation made in Hoeffding's 
classic paper (Hoeffding, 1963). This approach was recently alluded to by Lugosi (2003). As 



There may be various ways to state this in a meaningful manner. Essentially, we would like to know if 

for some learning problems a particular transductive algorithm can achieve better performance than any 

inductive algorithm, where performance may be characterized by learning rates and/or computational 

complexity. 

In the paper we use the terms 'error bound' and 'risk bound' interchangeably. Strictly speaking, the 

term 'generalization bound' is not appropriate in the transductive setting. 

Vapnik refers to transduction also as "estimating the values of a function at given points of interest" . 
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pointed out by Hoeffding (1963), the sampling without replacement problem can be reduced 
to an equivalent problem involving independent samples, for which standard techniques suf- 
fice. We refer to this approach as 'reduction to independence'. A second approach involves 
derivations of large deviation bounds for sampling without replacement such as those devel- 
oped by Serfling (1974), Vapnik (1998), Hush and Scovel (2003), and Dembo and Zeitouni 
(1998). We refer to such bounds as 'direct'. We consider these two approaches in Section 3.1 
and 3.2, respectively. Using these two approaches we derive general PAC-Bayesian bounds 
for transduction. It turns out that the direct bounds lead to tighter and more explicit 
learning curves for transduction. 

We then show how to utilize PAC-Bayesian transductive bounds to derive error bounds 
for specific learning schemes. In particular, we show how to choose priors, based on the 
given unlabeled data, and derive bounds for "compression schemes" and learning algorithms 
based on clustering. Compression schemes are algorithms that select the same hypothesis 
using only a subset of the training data. The main example of a compression scheme 
is a (transductive) Support Vector Machine (SVM). The compression level achieved by 
such schemes typically depends on the (geometric) complexity of the learning problem at 
hand, which sometimes does not allow for significant compression. A stronger type of 
compression can be often achieved using clustering. A natural approach in the context 
of transduction (and semi-supervised learning) is to apply a clustering algorithm over the 
set of all available (unlabeled) points and then to use the labeled points to determine the 
classifications of points in the resulting clusters. We formulate this scheme in the context 
of transduction and derive for it rather tight error bounds by utilizing an appropriate prior 
in our transductive PAC-Bayesian bounds. For practical applications a similar but tighter 
result is obtained by using the implicit bounds of Vapnik (1998) (or the bound of Blum & 
Langford, 2003). 

2. Problem Setup and Vapnik's Basic Results 

2.1 Problem Setup 

The problem of transduction can be informally described as follows. A learner is given a set 
of labeled examples {(xi,yi), . . . , (x m ,y m )}, X{ G X, y, t £ y, and a set of unlabeled points 
{x m+ i, . . . ,x m + u }. Based on this data, the objective is to label the unlabeled points. In 
order to formalize the scenario, we consider the setting proposed by Vapnik (1998, Chap. 
8), which for simplicity is described in the context of binary classification. 4 Let Hbea 
set of binary hypotheses consisting of functions from the input space X to y = {±1}. Let 
/i(x, y) be any distribution over X x y . For each h G 7i and a set Z = x±, . . . , x\z\ of samples 
define 

i |z| i lz{ r 

\ z \ ti W^Jyey 

where, unless otherwise specified, £(■, •) is the 0/1 loss function. Vapnik (1998) considers 
the following two transductive "protocols" . 



4. Note that the bound of Theorem 22 does hold for general bounded loss functions. 
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Setting 1 : 

(i) A full sample X m+U = {x±, . . . , x m+u } consisting of arbitrary m + u points is given. 5 

(ii) We then choose uniformly at random the training sample X m C X m+U and receive 
its labels Y m where the label y of x is chosen according to fj,(y\x); the resulting 
training set is S m = (X m , Y m ) and the remaining set X u is the unlabeled (test) sample, 

(iii) Using both S m and X u we select a classifier h £ Ii. The quality of h is measured by 
Rh(X u ). 

Setting 2: 

(i) We are given a training set S m = (X m ,Y m ) selected i.i.d according to /i(x,y). 

(ii) An independent test set S u = (X U ,Y U ) of u samples is then selected in the same 
manner. 

(iii) We are required to choose our best h £ H based on S m and X u so as to minimize 

/ m+u 
- ^2 Z(h( x i)iyi)dn(xi,yi)---dn(x m+u ,ym+u)- (2) 

i=m+l 

Remark 1 Notice that the choice of the sub-sample X m in Setting 1 is equivalent to sam- 
pling m points from X m+U uniformly at random without replacement. This leads to the 
samples being dependent. Also, Setting 1 concerns an "individual sample" X m+U and there 
are no assumptions regarding its underlying source. The only element of chance in this 
setting is the random choice of the training set from the full sample. 

Setting 2 may appear more applicable in some practical situations than Setting 1. How- 
ever, derivation of theoretical results can be easier within Setting 1. The following useful 
theorem (Theorem 8.1 in Vapnik, 1998) relates the two transduction settings. For com- 
pleteness we present Vapnik's proof in the appendix. 

Theorem 2 (Vapnik) If for some learning algorithm choosing an hypothesis h it is proved 
within Setting 1 that with probability at least 1 — 5, the deviation between the risks Rh(X m ) 
and Rf t (X u ) does not depend on the composition of the full sample and does not exceed e, 
then with the same probability, in Setting 2 the deviation between Rh{X m ) and the risk given 
by formula (2) does not exceed s. 

Remark 3 The learning algorithm in Theorem 2 is implicitly assumed to be deterministic. 
The theorem can be extended straightforwardly to the case where the algorithm is random- 
ized and chooses an hypothesis h 6 7i randomly, based on S m U X u . A particular type of 
randomization is the one used by Gibbs algorithms, as discussed in Section 4-1- 



5. The original Setting 1, as proposed by Vapnik, discusses a full sample whose points are chosen indepen- 
dently at random according to some source distribution y,(x). 
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Remark 4 The basic quantity of interest in Setting 2 is R m ,u{h) defined in (2). Assuming 
h is selected based on the sample S m U X u , one is often interested in its expectation over a 
random selection of this sample. In inductive learning, one considers the sample S m and 
a single new point (x,y), and the average is taken with respect to S m U {X}. It should be 
noted that the random variable R m ,u{h) is 'more concentrated' around its mean than the 
random variable £(h(x),y) corresponding to a single new point (x,y). Therefore, one may 
expect transduction to lead to tighter bounds in Setting 2 as well. 

In view of Theorem 2 we restrict ourselves in the sequel to Setting 1. Also, for simplicity 
we focus on the case where there exists a deterministic target function eft : X — ► y, so that 
y = 4>{x) is a fixed target label for x; that is, /j,(cj)(x)\x) = 1 (there is no requirement that 
<j) E H). Note that it is possible to extend our results to the the general case of stochastic 
targets y ~ n{y\x). 

We make use of the following quantities, which are all instances of (1). The quantity 
Rh{X m+u ) is called the full sample risk of the hypothesis h, Rh{X u ) is referred to as the 
transduction risk (of h), and Rh(X m ) is the training error (of h). Note that in the case 
where our target function <j) is deterministic, 

R h (X m ) = -Y J ^{ y \x i ){^h{xi),y)} = -^2l(h(xi),tj>(xi)). 

i=l i=l 

Thus, Rh(X m ) is the standard training error denoted interchangeably by Rh(S m ). It is 
important to observe that while Rh(X m+u ) is not a random variable, both i?/,(X m ) and 
Rh(X u ) are random variables, due to the random selection of the samples X m from X m+U . 
While our objective in transduction is to achieve small error over the unlabeled sample 
(i.e. to minimize Rh(X u )), we find that it is sometimes easier to derive error bounds for 
the full sample risk. The following simple lemma translates an error bound on Rf l {X m+u ), 
the full sample risk, to an error bound on the transduction risk Rh{X u ). 

Lemma 5 For any h G TC and any C 

A ~ TYl ~l~ U 

R h (X m+u ) < R h (S m ) + C «• R h (X u ) < R h (S m ) + —— ■ C. (3) 



a 



Proof For any h 



r, ( Y \ rnR h (X m ) + uR h (X u ) 
m + u 



Substituting Rh{S m ) for Rf l (X m ) in (4) and then substituting the result for the left-hand 
side of (3) we get 

o , Y \ mR h(S m ) + URhjXg) £ , . 

-K/il^m+uJ = ; S iifiK^m) + Is- 

m + u 
The equivalence (3) is now obtained by isolating Rh(X u ) on the left-hand side. ■ 
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Remark 6 In applications of Lemma 5, the term C = C(m) is typically a function m 
(and of some of the other problem parameters such as 5). In meaningful bounds C(m) — > 
with m — > oo. Observe that in order for the bound on Rh(X u ) to converge it must be that 
u = u>(mC(m)). 

Consider a hypothesis class TC. In the context of transduction, we are only interested 
in labeling the test set X u , which is given in advance. Thus, we may in principle regard 
hypotheses in TC which label X m+U identically as belonging to the same equivalence class. 
Since for fixed values of m and u the number of equivalence classes is finite (in the case of 
binary hypotheses, it is at most 2 m+u ) we may, without loss of generality, restrict ourselves 
to a finite hypothesis class (see, for example Vapnik, 1998, Sec. 8.5). Note that this freedom 
is not available in the inductive setting, where the test set is not known in advance. 

Remark 7 The above procedure is clearly not possible in the case of real-valued loss func- 
tions. However, in this case one can still use the availability of the full data set X m+U 
in order to construct an empirical e-cover of TC based on the empirical l\ norm, namely 
^i(fjd) = (m + u) _1 X^i" \f( x i) ~ 5 0^)1 f or an V fiQ^H. We restrict ourselves here to 
the binary case here. 

2.2 Vapnik's Implicit Bounds 

Fix some hypothesis h £ TC and suppose that h makes kh errors on the full sample (i.e. 
kh = (m + u)Rh{X m j rU )). Consider a random choice of the training set S m from the full 
sample, and let B(r, k^jm, u) be the probability that h makes exactly r errors over the 
training set S m . This probability is by definition the hypergeometric distribution, given by 



fk h \ (m+u-k h 

B(r,k h ,m,u) 



'h\ /m+u-k h \ 
A V r I \ m—r I 



tm+u\ 
\ m ) 

Since m and u are fixed, throughout this discussion we abbreviate B(r, kh) = B(r, kh,m, u). 
Define 

C{s, k h ) ± Pr{R h (X u ) - R h (X m ) > e} 

k h -r r 
Pr <; > e 

m 



X>Mfc), 



where the summation is over all values of r such that maxj/c/, — u, 0} < r < min{m, kh} and 

kh — r r 



u m 

Define 



> e. (5) 



A 



r(e) = maxC( J— —-s^) . (6) 

k \ V m + u I 
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We now state Vapnik's implicit bound for transduction. The bound is slightly adapted 
to incorporate a prior probability over Ti (the original bound deals with a uniform prior). 
Also, the original bound is two-sided and the following theorem is one-sided. 6 

Theorem 8 (Vapnik 1982) Let 5 be given, let p be a prior distribution over Ti, and let 
£*(h) be the minimal value of e that satisfies T(e) < p(h)6. Then, with probability at least 
1 — 5, for all h £ TL, 

R h { X)-R h {X m )_^^ (h) (?) 

Proof Using the union bound we have, 

Q = PrLh£H : Rh( X*)-R h (X m l >e . {h) \ 

I \/Rh(x m +u) J 



Pr 



{l£H : R h (X u ) - R h (X m ) > ^/R h (X m+u )e*(h)} 



: R h (X u ) - R h {X m ) > - / -^—e*(h) 



hen V / 



< 



£ r (e*(fc)) 



hen 



hen 



: J2 p(h)S = 5. 

h{X u )—R h (X 

V Rh(Xm+u) 



Note: The convention flfe( f p " ) ^ fe(Xm) = is used whenever R h {X u ) = R h (X m ) = R h (X m+ 
0. 



It is not hard to convert the bound (7) to the "standard" form (i.e. expressed as empir- 
ical error plus some complexity term). Squaring both sides of (7) and then substituting 
^^R h (X m ) + j^RhiXu) for R h (X m+u ) we get a quadratic inequality where the "un- 
known" variable is Rh{X u ). Solving for Rh(X u ) yields the following result (as in Vapnik, 
1998, Equation (8.15)). 

Corollary 9 Under the conditions of Theorem 8, 



Rh(X u ) < R h (X m ) + p£pL + e*(h) X R h {X m ) + ' 



2(m + u) y \2(m + u 

Remark 10 Theorem 8 focuses on the relative deviation of the divergence fi|,(X M )-i?/,(X m ), 
and generates a bound that is particularly tight in cases where the empirical error Rh{X m ) 
is very small. A very similar but simpler version of Vapnik's bound concerns the "absolute" 



6. Specifically, in the original bound (Vapnik, 1982) the two-sided condition | ' h — -j^\ > e is used instead 
of condition (5). 
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deviation is tighter in other cases of interest. The absolute bound is obtained as follows. 
Define, instead of (6), 

r(e) = maxC (e,k) , 

k 

and let £*{h) be the minimal value of e that satisfies T(e) < p(h)5, for any given 5. Then, 
the absolute bound states that with confidence at least 1 — 5, 

R h (X u )<R h (X m )+s*(h). (9) 

This result is presented by Bottou, Cortes, and Vapnik (1994). 

Remark 11 The bound of Corollary 9, and the bound (9), are rather tight. Possible sources 
of slackness are only introduced through the utilization of the union bound in (8) and the 
definition ofY in (6). However, note that e*(h) is a complicated implicit function of m, u, 
p(/i) and 5 leading to a bound that is difficult to interpret and (as noted also by Vapnik) 
must be tabulated by a computer in order to be used. 

Note that a related result has recently been presented by Blum and Langford (2003). 
Specifically, Theorem 6 in that paper states a similar implicit bound, based on direct cal- 
culation of the hypergeometric tail. However, Vapnik's bound was originally proved for 
a uniform prior over the hypothesis class, and the extension to general priors was first 
proposed by Blum and Langford (2003). 7 

3. Concentration Inequalities for Sampling without Replacement 

In this section we present several concentration inequalities that will be used in Section 4 to 
develop PAC-Bayesian bounds for transduction. As discussed in Section 2 (see Remark 1), 
sampling without replacement leads to dependent data, precluding direct application of 
standard large deviation bounds devised for independent samples. Here we present several 
concentration inequalities for sampling without replacement. 

3.1 Inequalities Based on Reduction to Independence 

Even though sampling without replacement leads to dependent samples, Hoeffding (1963) 
pointed out a simple procedure to transform the problem into one involving independent 
data. While this procedure leads to non-trivial bounds it involves some loss in tightness 
(see Section 3.2). 

Lemma 12 (Hoeffding 1963) Let C = {c±, . . . ,cjv} be a finite set with N elements, let 
{Xi, . . . , X m } be chosen uniformly at random with replacement from C, and let {Z\, . . . , Z m } 
be chosen uniformly at random without replacement from C 8 Then, for any continuous 
and convex real-valued function f(x), E/ (XXLi %i) ^ E/ (Yl'iLi -^t) ■ 



7. Also, the bound of Blum and Langford (2003) may be tighter than the Vapnik's bounds in some cases 
of interest. A careful numerical comparison should be conducted to determine the relative advantage of 
each of these bounds. 

8. Note that the variables {Xi, . . . , X m } are independent, while {Zi, . . . , Z m } are dependent. 
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Lemma 12 can be used in order to generate standard exponential bounds, as proposed by 
Hoeffding (1963). We first introduce some notation which will be used in the sequel. Let v 
and \x be two real numbers in [0, 1]. We use the following definitions for the binary entropy 
and binary KL-divergence, respectively. 

H(v) = —ulogv — (1 — v) log(l — v), 

D(v\\u) = i/log- + (l-i/)log- . 

u 1-u 

Theorem 13 (Hoeffding, 1963) Let C = {ci,...,cjv} be a finite set of non-negative 
bounded real numbers, Ci < B, and set c = (1/rri) X^IILi c i- Let Z\, . . . ,Z m , be random 
variables obtaining their values by sampling C uniformly at random without replacement. 
Set Z = (1/m) Yl"iLi Z%- Then, for any e < 1 — c/B, 

Pr{Z-EZ> e} < exp f-mD (■?- + e -|) } (10) 

Ime 
~B2 



-^i 



Similar bounds hold for Pr{EZ — Z > e}. 

The key to the proof Theorem 13 is the application of Lemma 12 with /(^i Zi) = exp{^(Z,;- 
EZj)} and the utilization of the Chernoff-Hoeffding bounding technique (see Hoeffding, 
1963, for the details). 

3.2 Sampling Without Replacement - Direct Inequalities 

In this section we consider approaches which directly establish exponential bounds for sam- 
pling without replacement. As opposed to Vapnik's results (1982, 1998), which provide 
tight but implicit bounds, we aim at bounds which depend explicitly on all parameters of 
interest. Note that the bound of Theorem 13 does not depend on all the parameters (in 
particular, the population size iV does not appear and clearly, small population size should 
affect the convergence rate). One may expect that bounds developed directly for sampling 
without replacement should be tighter than those based on reduction to independence. The 
reason for this is as follows. Assume, we have sampled k out of N points without replace- 
ment. The next point is to be sampled from a set of iV — k rather than iV points, which 
would be the case in sampling with replacement (where the samples are independent). The 
successive reduction in the size of the sampled set reduces the 'randomness' of the newly 
sampled point as compared to the independent case. This intuition is at the heart of Ser- 
fling's improved bound (Serfling, 1974), which is stated next. This result holds for general 
bounded loss functions and is established by a careful utilization of martingale techniques 
combined with Chernoff's bounding method. 

Theorem 14 (Serfling, 1974) Let C = {c\, . . . ,cn} be a finite set of bounded real num- 
bers, |cj| < B. Let Zi, . . . ,Z m , be random variables obtaining their values by sampling C 
uniformly at random without replacement. Set Z = — Y^hLi Zi- Then, 

f2me 2 \ ( N 



P r{Z-EZi iex ^-^—^ w —— i]> (12) 
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and similarly for Pr{EZ m — Z m > e}. 

Compared to the bound of Theorem 13, the bound in (12) is always tighter than Hoeffding's 
second bound (11) when N/(N — m + 1) > 1 (i.e. when m > 1). When applied to our 
transduction setup (see Section 4.1) we take N = m + u and the advantage is maximized 
when (m + u)/(u + 1) is maximized. Thus, considering only the convergence rate obtained 
by sampling without replacement, one may expect that the fastest rates should be obtained 
when u assumes the smallest possible value (e.g. u = 1) and not surprisingly, the advantage 
over the bound of Theorem 13 vanishes as u — > oo. 

In the case where the q are binary variables, the bound in Theorem 14 can be improved, 
by using a proof based on a counting argument. The following theorem and proof is based 
on a simple consequence of Lemma 2.1.33 by Dembo and Zeitouni (1998). 

Theorem 15 Let C = {c\, . . . , cat}, Cj € {0, 1}, be a finite set of binary numbers, and set 

. , Z m , be random variables obtaining their values by sampling 



5= (1/JV) £i=i <*. LetZ ± , 

C uniformly at random without replacement. Set Z 

if 9 e< min{l - c,c(l - /?)//?}, 



Pr{Z -EZ > e} < exp { -mD(c + e\\c) - (N - m) D c 



1/m) YllLi Zi and (5 

pe 



m/N. Then, 



+ 71og(iV + r 



1-/3 

Proof Denote by N$ and Ni the number of appearances in C of q = and Cj = 1, 
respectively. Let mo and mi be integers < mo, mi < m, such that m^ + mi = m. 
The probability of observing mi appearances of '1' (and thus mo appearances of '0' ) in 
a random sub-sample selected without replacement is the number of m.-tuples resulting in 
mo (mi) appearances of 0(1) in the subsample, divided by the overall number of m-tuples, 

\m\) \moJ 



Pr 



x> 



mi 



fN\ 



Setting u = c, the probability that Z = (1/m) Y^h=i ^i is greater than v = a + e 
(for some natural number mi) is then given by 



mi/m 



Pr 



m *-^ 



> v 



Using the Stirling bound we have that 



max 

Km<N 



log 



in 



NH 



We thus find that 




Pr (^ z - > "| 


m f 

< y, ex p v 


I i=l J 


mi=mu ^ 




m f 




= }Z ex p{- 




m\=mu *> 



1 No 



mD{y\ 



m\ = \mv~\ 



■N- 



1\ (JV„N 

1/ \moJ 

TV 

\mJ 



<21og(A^ + l). 



+ NiH[^ 

Ni 



(N - m)D 



u — [iv 



l-P 



NH( — \ +61og(iV + l) 



fi) +6iog(jv + i; 



9. The second condition, e < c(l — /3)//3, simply guarantees that the number of 'ones' in the sub-sample 
does not exceed their number in the original sample. 
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The claim is concluded by upper bounding the sum by the product of the number of terms in 
the sum and the maximal term. It is easy to verify that the maximal summand is attained 
at v = fj,. Assuming that u > //, and using the convexity of D(u\\fj.) with respect to i>, 
we conclude that the largest contribution to the sum is attained at mi = mu, yielding the 
bound 

Pri — ^2,Zi>v\ < m(l-v)exp<-mD(v\\n)- {N - m) D y~ ^ ju) +61og(JV + l)l 



i=l 

' fj, — f3v 



< exp <^ -mD(u\\n) - (N - m)D 



fx +71og(JV + i; 



. 1-/3 

which establishes the claim upon setting v = c + e. ■ 

Note that the proof of the last bound does not rely on the Chernoff-Hoeffding bounding 
technique as used by Hoeffding (1963) and in many other derivations, but rather on a direct 
counting argument. 

Remark 16 We are aware of another concentration inequality for sampling without re- 
placement, which also applies to binary variables. This inequality, by Hush and Scovel 
(2003), is an extension of a result by Vapnik (1998, Sec. 4.13). While Vapnik's result 
concerns the case m = N/2 (u = m in the transduction setup), the Hush and Scovel result 
considers the general case of arbitrary m and N . The transduction bound we obtain using 
the Hush and Scovel concentration inequality is more complex but is qualitatively the same 
as the bound of Corollary 23 that we later present. We therefore omit this bound. 

4. PAC-Bayesian Transduction Bounds 

In this section we present general error bounds for transductive learning. Our bounds can be 
viewed as extensions of McAllester's PAC-Bayesian inductive bounds (1999, 2003a, 2003b). 
In Section 4.1 we focus on simple randomized learning algorithms which are typically re- 
ferred to as 'Gibbs algorithms'. Then in Section 4.2 we consider a standard deterministic 
setting. In the case of binary classification the bounds for deterministic learning are com- 
parable to Vapnik's bounds presented in Section 2.2. Unlike the implicit but tight PAC- 
Bayesian bound of Theorem 8 (and Corollary 9), the new bounds are somewhat looser but 
explicit. 

4.1 Bounds for Transductive Gibbs Learning 

We present two bounds. The first is a rather immediate extension of McAllester's bound 
(2003b, equation (6)), using a reduction to independence, as discussed in Section 3.1. The 
second bound is based on the 'direct approach' and is considerably tighter in many cases of 
interest. 

Within the original inductive setting, the selection of the prior distribution p in the 
PAC-Bayesian bounds must be made prior to observing the data. As we later show, in 
the present transductive setting it is possible to obtain much more compact (and effective) 
priors by first observing the full input sample X m+U and using it to construct a prior 
p = p(X m+u ). However, as shown by McAllester (2003a), under certain conditions it is 
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possible to provide performance guarantees even if we select a "posterior" distribution over 
the hypothesis space after observing the labels of the training points X m . The guarantee is 
provided for a Gibbs algorithm, which is simply a stochastic classifier defined as follows. Let 
q be any distribution over TL. The corresponding Gibbs classifier, denoted by G q , classifies 
any new instance using a randomly chosen hypothesis h £ TL, with h ~ q (i.e. each new 
instance is classified with a potentially new random classifier). 

For Gibbs classifiers we now extend definition (1) as follows. Let Z = x\, . . . ,x\z\ be 
any set of samples and let G q be a Gibbs classifier over TL. The (expected) risk of G q over 
Z is 

f 1 lZl 

As before, when Z = X m (the training set) we use the standard notation i?c q (5 m ) = 

R G<i (Xm). 

The first risk bound we state for transductive Gibbs classifiers is a simple extension of the 
recent inductive generalization bound for Gibbs classifiers presented by McAllester (2003b). 
The new transductive bound relies on reduction to independence and its proof follows almost 
exactly the proof of the original inductive result. We therefore omit the proof but note that 
the inductive bound relies on the variant of Theorem 13 (inequality (10)) for sampling 
with replacement. The new bound is obtained by bounding the divergence between the 
Rh(X m+u ) and Rh{S m ) now relying on inequality (10), which concerns sampling without 
replacement. The bound on the transductive risk Rh{X u ) is obtained using the following 
simple generalization of Lemma 5, stating that for all q and C, 

m -\~ u 
Ra^Xm+u) < R Gci (S m ) + C & Rg^Xu) < i? Gq (5 m ) + —— ■ C. (13) 

Theorem 17 (Gibbs Classifiers) Let X m+U = X m U X u be the full sample. Let p = 
p(X m + u ) be a (prior) distribution over TL that may depend on the full sample. Let 5 € (0, 1) 
be given. Then with probability at least 1 — 5 over the choices of S m (from the full sample) 
the following bound holds for any distribution q, 



Ite.™ < Ra,(S m)+ (^) J »fe.(w(°(y+-?) + »w«+-?) 



\ u J \ V m — 1 m — 1 

(14) 
where -D(-||-) is the familiar Kullback-Leibler (KL) divergence (see e.g. Cover & Thomas, 
1991). 

Notice that when Rq (S m ) = (the so-called "realizable case") fast convergence rates of 
order 1/m are possible when u is sufficiently large (i.e. u = w(m)). 

The next risk bound we present for transductive Gibbs binary classifiers relies on the 
"direct" concentration inequality of Theorem 15, for sampling without replacement. The 
proof is based on the proof technique recently presented by McAllester (2003b), which, in 
turn, is based on the results of Langford and Shawe- Taylor (2002) and Seeger (2003). 
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Theorem 18 (Binary Gibbs Classifiers) Let the conditions of Theorem 11 hold, and 
assume the loss is binary. Then with probability at least 1 — 5 over the choices of S m (from 
the full sample) the following bound holds for any distribution q, 



R Gci (X u ) < RG^S m ) + 



( 2R G<i (S m )(m + u) \ J>(q||p)+lnff + 71og(m + tt + l) 



m — 1 



+ 



2 (£>(q||p) + In f + 71og(m + u + 1)) 



m — 1 



Before we prove Theorem 18 observe that when i2c q (5 m ) = (the "realizable case") the 
bound converges even if u = 1. In contrast, the bound of Theorem 17 diverges in the 
realizable case for any u = 0(^/rn). 

For proving Theorem 18 we quote without proof two results by McAllester (2003b). 

Lemma 19 (Lemma 5, McAllester, 2003b) Let X be a random variable satisfying 
Pr{X > x} < e~ m ^ x > where f{x) is non-negative. Then E [e( m_1 "'^] < m. 

Lemma 20 (Lemma 8, McAllester, 2003b) E^ q [/(x)] < £>(q||p) + In E x ^ p e f( - x) . 

Proof of Theorem 18: The proof is based on the ideas by McAllester (2003b). Define 

Vh = Rh{Sm) ; ^h = Rh{X m +u)- 

Let 



f h (y) = D(u\\p h ) + —D — —\\Hh 

m \ 1 — p 



7 



in 



log(m + u + l) 



^,From Lemma 20 we have that 

E^ q [(m - l)/ fc (i/)] < D(q\\p) + lnE^ p 



,(m-l)f h (v) 



(15) 



An upper bound on E/i^ p [e*-" 1 ^'hC")] m ay be obtained by the following argument. From 
Theorem 15 we have that 

Pr-Ofc >v}< exp{-mf h {v)}. 

Lemma 19 then implies that for any h 



E s„ 



Am-l)f h {v h ) 



< m, 



which implies that 



Es m E^ p 



,(m-l)f h (y h ) 



< m, 



from which we infer that with probability at least 1 — 5, 



E 



h~p 



,( m - l )fh(i>h) 



m 
< — 
- 6 
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by using Markov's inequality. Substituting in (15) we find that with probability at least 
1-5, 

Tfl 

Eh~q [(m - l)A(i) ft )] < D(q||p) + j . (16) 

Substituting for fhiph), and using the convexity of the function xlogx, we find that 

D(R G ^S m )\\R G<i (X m+u )) + ^D ( Rg ^ X ^ -0RG«(Sm) m {Xm+u) \ _ l log(m + n + i) 
H M m \ 1 — p I m 



< 



£>(q||p)+ln^ 



m — 1 
In order to obtain an explicit bound, we use the inequality 

(i/ - ^) 2 



(17) 



£> (i/ w > 



2/z ' 



and substitute this in (17) obtaining 

RcAXm+u) < R G JS m ) + 



N 



■RG„(^m+«) 



2u \ -D(q||p) +lnf + 71og(m + u+l) 



m + u 



m — 1 



Thus we have (with probability at least 1 — 5), z < a + Vzb (where z = Rc ti (X rn+u )). 
Solving for z we get z < a + b + va6, and using Lemma 5 yields the desired result. □ 

Remark 21 It is interesting to compare the bound of Theorem 11, based on the reduction to 
independence approach, and that of Theorem 18 which is based on a direct concentration in- 
equality for sampling without replacement. The complexity term in Theorem 17 is multiplied 
by (m + u)/u, while the corresponding term in Theorem 18 is multiplied by \f(m + u)/u. 
This clearly displays the advantage of using the direct concentration bound, even though it 
does not lead to improved convergence rates in general. More importantly, for the realizable 
case, RGq(Sm) = 0, the bound of Theorem 18 converges to zero even for u = 1. This is not 
the case for the bound of Theorem 1 7. 

4.2 Bounds for Deterministic Learning Algorithms 

In this section we present three transductive PAC-Bayesian error bounds for deterministic 
learning algorithms. Note that the two bounds we present for the (stochastic) Gibbs al- 
gorithms in the previous subsection can be specialized to deterministic algorithms. This 
is done by choosing a "posterior" q which assigns probability 1 to one desired hypothesis 
h E Ti. In doing so the term Z3(q||p) reduces to log(l/p(/t)). For example, the bound of 
Theorem 17 reduces to 



/ 



Rfi(Xu) < Rh{Sm) + 



m + u 



a 



\ 



\ 



2Rf l (S 11 



l°g^+lnf 



+ 



2(Mpfe + ln T 



m 



m 



\ 



(18) 
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which applies to any bounded loss function. 

The following bound relies of the Serfling concentration inequality presented in Theo- 
rem 14 and applies to any bounded loss function. 10 

Theorem 22 Let X m+U = X m U X u be the full sample and let p = p(X m+u ) be a (prior) 
distribution over TL that may depend on the full sample. Assume that £(h(x),y) E [0,-B] 
and let <5 E (0, 1) be given. Then, with probability at least 1 — 5 over choices of S m (from 
the full sample) the following bound holds for any h £Ti, 



Rh(X u ) < R h (S m ) + B 



\ 



l\ ( lrv W) +ln \\ (1Q) 



u J \ u J \ 2m 



Proof In our transduction setting the set X m (and therefore S m ) is obtained by sampling 
the full sample X m+U uniformly at random without replacement. It is not hard to see that 
^Y, m Rh{S m ) = Rh(X m+u ). Specifically, 

E Sm i4(S m ) = ^£W = T^A E ~ £ £{h(x)A{x)). (20) 

By symmetry, all points x E X m+U are counted on the right-hand side an equal number 
of times; this number is precisely ( m J^ n ) — ( m+ m~ ) = ( m Ji-i )• ^ ne resm t is obtained by 
considering the definition of R h (X m+u ) and noting that C^ 1 ) /("£*) = T^RI- Usin S the 
fact that our loss function is bounded in [0, B] we apply Theorem 14 (for a fixed h and 

N = m + u), 

{) 2me 2 ( m+u \ 

BR h (S m ) - R h (S m ) >ej< e ~ ia-U+r;. (21) 



/( M + 1 )( ln -nrr +ln l) 
Setting e(h) = By (m+w~ an d using the union bound we find 

Pr Sm [lh E H s.t. R h (X m+u ) - R h (S m ) > s(h)} < J>xp {- ^ ( " 

h ^ 



B 2 \u + l 

h 



h 
6. 



We thus obtain that 

Rh(X m +u) < Rh(Sm) + B 

The proof is then completed using Lemma 5 



N 



Ml\ / ln i^)+ ln f 



m + u I \ 2m 



(22) 



For classification using the 0/1 loss function we present one bound, which is a special- 
ization of Theorem 18. 



10. Although we have not utilized the Serfling inequality for devising a bound for the Gibbs algorithm, it 
can be done as well. 
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Corollary 23 Let the conditions of Theorem 22 hold and assume the loss is binary. Then 
with probability at least 1 — 8 over the choices of S m (from the full sample) the following 
bound holds for any distribution q, 



Rh\X u ) < Rh{S m ) + 



( 2R h (S m ){m + u) \ 1 °gRHy + ln T + 71 °g(^ + ^ + 1 ) 

\ I U J 771—1 



2 ( log -jU + In f + 71og(m + u + 1] 

+ 1 • 

m — 1 

Figures 1 and 2 compare the two bounds presented in this section with Vapnik's bound 
of Corollary 9. Throughout the discussion here the bound of Theorem 22 is referred to as 
the "Serfling bound". Figure 1 focuses on the realizable case (i.e. empirical error = 0). 
According to the statements of Theorem 22 and Corollary 23, the Serfling bound has a 
significantly slower rate of convergence in the realizable case. However, the constants (and 
logarithmic terms) are larger in the bound of Corollary 23. Panels (a) and (b) in Figure 1 
indicate that the Serfling bound is significantly better than the bound of Corollary 23 when 
u = fi(m) for the range of m we consider. However, even in these cases, we know that 
the bound of Corollary 23 will eventually outperform the Serfling bound. We also see that 
the Serfling bound tracks Vapnik's bound quite well when u = O(m). On the other hand, 
Panels (c) and (d) indicate that the bound of Corollary 23 is significantly better than the 
Serfling bound when u = o(m). The examples given are u = \fm in Panel (c) and u = 10 
in Panel (d). Figure 2 shows these bounds for the case Rh(S m ) = 0.2. Here again the 
Serfling bound nicely tracks Vapnik's bound and we see that the bound of Corollary 23 
converges much more slowly. All the curves in Figures 1 and 2 consider the case p(h) = 1. 
This assignment of the prior eliminates the influence of the union bound that is used to 
derive these bounds. In Figure 3 we show, for both the Vapnik and Serfling bounds, the 
complexity term as a function of the prior p(/i), with 0.01 < p(h) < 1. Note that such prior 
assignments are realistic in the case of the transduction algorithm based on clustering that 
is introduced in Section 5.2. 

While these plots indicate that our bounds approximate Vapnik's bound quite well in 
many cases of interest, we also see that for small values of m (similar to those considered 
in the plots) one will gain in applications by using the implicit but tighter Vapnik bound 
(or the Blum and Langford,2003, bound). 

5. Bounds for Specific Algorithms 

PAC-Bayesian error bounds (both inductive and transductive) are interesting because they 
provide a very simple yet general formulation of learning. However, in order to provide more 
concrete statements (e.g. about specific learning algorithms) one must apply such bounds 
with some concrete priors (and posteriors, in the case of Gibbs learning, see Theorems 17 
and 18). In the context of inductive learning, a major obstacle in deriving effective bounds 11 
using the PAC-Bayesian framework is the construction of "compact priors" . For example, 



11. Informally, we say that a bound is "effective" if its complexity term vanishes with m (the size of the 
training sample) and it is sufficiently small for "reasonable" values m. 
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Figure 1: A comparison of Vapnik's bound (Corollary 9), the bound of Theorem 22 (denoted 
here by "Serfling bound") and the bound of Corollary 23. All bounds assume zero 
empirical error, 5 = 0.01 and p(h) = 1. (a) u = 10m; (b) u = m; (c) u - 
and (d) u = 10. 



■;??; 



McAllester's generalization bound (McAllester, 1999) contains a complexity term which 
includes a component of the form ln(l/p(/i)) where p is a prior over Ti (as in Theorem 22 and 
Corollary 23). The more sophisticated inductive bounds for Gibbs classifiers (McAllester, 
2003a, 2003b) include a Kullback-Leibler (KL) divergence complexity component D(q||p), 
where p is a prior over TC and q is a posterior over H. (as in Theorems 17 and 18). However, 
many hypothesis classes of interest are very large and even uncountably infinite. Therefore, 
despite the fact that these bounds apply in principle to very large TC, in a straightforward 
application of these PAC-Bayesian bounds, when choosing priors with a very large support 
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Figure 2: A comparison of Vapnik's bound (Corollary 9), the bound of Theorem 22 (denoted 
here by "Serfling bound") and the bound of Corollary 23. All bounds assume 
empirical error of 0.2, 5 = 0.01 and p(h) = 1. (a) u = 10m; (b) u = m. 
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Figure 3: The complexity term of the Vapnik and Serfling bounds as a function of p(h) 
with 0.01 < p(h) < 1, 5 = 0.01 and m = u = 50. 



(and possibly a posterior with a small support), the complexity terms in these bounds can 
diverge or at least be too large to form effective generalization bounds. 12 



12. Saying that we should also note that sophisticated prior choices within the inductive PAC-Bayesian 
framework can also lead to state-of-the-art bounds (McAllester, 2003b). 
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In contrast, the transductive framework provides a very convenient setting for applying 
PAC-Bayesian bounds. Here priors can be chosen after observing and analyzing the full 
sample. As already mentioned, in the case of the 0/1-loss, even if we consider a very large 
hypothesis space TC, after observing the full sample, the effective size of equivalence classes 
of hypotheses in TC is always finite and not larger than the number of dichotomies of the 
full sample (see also Remark 7). 

In this section we present bounds for specific learning algorithms. The first class of 
algorithms we consider in Section 5.1 are "compression schemes". The second class, are 
algorithms based on clustering. In both cases we show how to form effective priors by 
considering the structure of the full sample. 

5.1 Bounds for Compression Algorithms 

We propose a technique for selecting a prior p(h) over TC, based on the full (unlabeled) 
sample X m+U . Given m, the learner constructs m "sub-priors" p r , r = 1, 2, . . . , m, based 
on the full sample X m+U , and for the final prior, takes a uniform mixture of all these 
"sub-priors" . 

This technique generates transductive error bounds for "compression" algorithms. Let 
A be a learning algorithm. Intuitively, A is a "compression scheme" if it outputs the same 
hypothesis using a subset of the (labeled) training data. 

Definition 24 A learning algorithm A (viewed as a function from samples to some hypoth- 
esis class) is a compression scheme with respect to a sample Z if there is a sub-sample Z' , 
\Z'\<\Z\, such that A{Z') = A{Z). 

Observe that the Support Vector Machine (SVM) approach is a compression scheme, where 
the set Z' is determined by the set of support vectors. 

Let A be a deterministic compression scheme and consider the full sample X m+U . For 
each integer r = 1, . . . , m, consider all subsets of X m+U of size r, and for each subset con- 
struct all possible dichotomies of that subset (note that we are not proposing this approach 
as a useful algorithm, but rather as a means to derive bounds; in practice one need not 
construct all these dichotomies). A deterministic algorithm A generates at most one hy- 
pothesis h G TC for each dichotomy. 13 For each r, let the set of hypotheses generated by 
this procedure be denoted by TC T . For the rest of this discussion we assume the worst case 
where \TC T \ = 2 r ( m +") (i.e. if TC T does not contains one hypothesis for each dichotomy the 
bounds we propose below improve). The "sub-prior" p T is then defined to be a uniform 
distribution over TC T . 

In this way we have m "sub-priors" , pi, . . . , p m , which are constructed using only X m+U 
(and are independent of the labels of the training set Y m ; also note that this construction 
takes place before choosing the subset X m ). Any hypothesis selected by the learning algo- 
rithm A based on the labeled sample S m and on the test set X u belongs to \J^ = {TC T . The 
motivation for this construction is as follows. Each r can be viewed as our "guess" for the 
maximal number of compression points that will be utilized by a resulting classifier. For 



13. It might be that for some dichotomies the learning algorithm will fail to construct a classifier. For 
example, a linear SVM in feature space without "soft margin" will fail to classify non linearly-separable 
dichotomies of X m +u- 
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each such r the distribution p r is constructed over all possible classifiers that use r com- 
pression points. By systematically considering all possible dichotomies of r points we can 
characterize a relatively small subset of 7i without observing labels of the training points. 
Thus, each "sub-prior" p T represents one such guess. The final prior is 

-. m 

p(h) = -Y,Pr(h). (23) 

r=l 

The following corollary is obtained by an application of Theorem 22 using the prior 
p(/i) in (23). This results characterizes an upper bound on the divergence in terms of the 
observed size of the compression set of the final classifier. 

Corollary 25 (Transductive Compression Bound) Let the conditions of Theorem 22 
hold. Let A be a deterministic learning algorithm leading to a hypothesis h ETC based on a 
compression set of size s. Then, with probability at least 1 — 5, 



Rh(X u ) < Rh(S m ) + 



N 



m + u\ fu + l\ /sln(^ ± ^)+ln(?n/,5)) 

(24) 



m 



Proof Recall that TL S C TL is the support set of p s and that p s (h) = l/\7i s \ for all h £ TC S , 
implying that ln(l/p s (h)) = \TC S \- Using the inequality ( m ^~ n ) < (e(m + u)/s) s we have 
that \H S \ = 2 s ( m +") < (2e(m + u)/s) s . Using the prior (23) and substituting ln(m/p s (/i)) 
in Theorem 22 leads to the desired result. ■ 



Remark 26 We can use Corollary 23 to get a similar result, which is sometimes tighter. 
Also note that compression bounds can be easily stated and proved for Gibbs learning. 

The bound (24) can be easily computed once the classifier is trained. If the size of the 
compression set happens to be small, we obtain a tight bound. We note that these bounds 
are applicable to the transductive SVM algorithms discussed by Vapnik (1998), Bennett and 
Demiriz (1998) and Joachims (1999). However, our bounds motivate a different strategy 
than the one that drives these algorithms; namely, reduce the number of support vectors! 
(rather than enlarge the margin, as attempted by those algorithms). 

Observe the conceptual similarity of our bound to Vapnik's bound for consistent SVMs 
(Vapnik, 1995, Theorem 5.2), which bounds the generalization error of an SVM by the 
ratio between the average number of support vectors and the sample size m. However, 
Vapnik's bound can only be estimated while this bound is truly data dependent. Finally, it 
is interesting to compare our result to a recent inductive bound for compression schemes. 
In this context, Graepel et al. (see Theorem 5.18 in Herbrich, 2002) have derived a bound 
of the form 



fl(SVM) < ^(SVM) + f ,0e{2em/ '^ n _ { l< S) +2lUm . (25) 

where -R(SVM) and -R(SVM) denote the true and empirical errors, respectively, and s is the 
number of observed support vectors (over the training set). 
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5.2 Transductive Learning via Clustering 

Some learning problems do not allow for high compression rates using compression schemes 
such as SVMs (i.e. the number of support vectors can sometimes be very large, see e.g. 
Baram et al., 2004, Table 1). A considerably stronger type of compression can often be 
achieved by clustering algorithms. While there is lack of formal links between entirely 
unsupervised clustering and classification, within a transductive setting we can provide a 
principled approach to using clustering algorithms for classification. 

In particular, we propose the following approach: The learner applies a clustering algo- 
rithm (or a number of clustering algorithms) over the unlabeled data to generate several 
(unsupervised) models. The learner then utilizes the labeled data to guess labels for entire 
clusters (so that all points in the same cluster have the same label). In this way, a number 
of hypotheses are generated, one of which is then selected based on a PAC-Bayesian error 
bound for transduction, applied with an appropriate prior. 

The natural idea of first clustering the unlabeled data and then assigning labels to 
clusters has been around for a long time and there are plenty of heuristic procedures that 
attempt to learn using this approach within a semi-supervised or transductive settings 
(see, e.g., Seeger, 2002, Sec. 2.1). However, to the best of our knowledge, none of the 
existing procedures was theoretically justified in terms of provable reduction of the true 
risk. In contrast, the clustering-based transduction method we propose here relies on a 
solid theoretical ground. 

Let A be any (deterministic) clustering algorithm which, given the full sample X m + U , 
can cluster this sample into any desired number of clusters. We use A to cluster X m+U into 
1, . . . , c clusters where c < m. Thus, the algorithm generates a collection of partitions of 
X m+U into t = 1,2, ... , c clusters, where each partition is denoted by C T . For each value 
of r, let 7i T consist of those hypotheses which assign an identical label to all points in the 
same cluster of partition C T , and define the "sub-prior" p T (h) = l/2 r for each h £ TC T and 
zero otherwise (note that there are 2 T possible dichotomies). The final prior is a uniform 
mixture of all these sub-priors, p(/i) = - X^-P^W- 

The learning algorithm selects a hypothesis as follows. Upon observing the labeled 
sample S m = (X m , Y m ), for each of the clusterings C±, . . . , C c constructed above, it assigns 
a label to each cluster based on the majority vote from the labels Y m of points falling within 
the cluster (in case of ties, or if no points from X m belong to the cluster, choose a label 
arbitrarily). Doing this leads to c classifiers h T , r = 1, . . . , c. We now apply Theorem 22 
with the above prior p(h) and can choose the classifier (equivalently, number of clusters) 
for which the best bound holds. We thus have the following corollary of Theorem 22. 

Corollary 27 Let A be any clustering algorithm and let h T , r = 1, . . . ,c be classifications 
of the test set X u as determined by clustering of the full sample X m+U (into r clusters) and 
the training set S m , as described above. Let 5 £ (0, 1) be given. Then, with probability at 
least 1 — 5 over choices of S m the following bound holds for all r, 



'm + u\ fu+l\ /r + ln^ 



*,(*.) s *,<*.) + V — — [^r I (26) 
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Remark: Note that when m = u we get the bound 



JMX.)<JUW (1 + * )(T + lnS) . 

V m 

Also, in the case of the 0/1 loss we can use Corollary 23 to obtain significantly faster rates 
in the realizable case or when the training error is very small. We see that these bounds can 
be rather tight when the clustering algorithm is successful (e.g. when it captures the class 
structure in the data using a small number of clusters). Note however, that in practice one 
can significantly benefit by the faster rates that can be achieved utilizing Vapnik's implicit 
bounds presented in Section 2.2 (or the bound of Blum and Langford, 2003). Clearly, any 
PAC-Bayesian bound for transduction can be plugged-in within this scheme and tighter 
bounds should yield better performance. 

Corollary 27 can be extended in a number of ways. One simple extension is the use of 
an ensemble of clustering algorithms. Specifically, we can concurrently apply k different 
clustering algorithms (using each algorithm to cluster the data into r = 1, . . . , c clusters). 
We thus obtain kc hypotheses (partitions of X m+U ). By a simple application of the union 
bound we can replace In | by In -j- in Corollary 27 and guarantee that k bounds hold si- 
multaneously for all the k clustering algorithms (with probability at least 1 — 5). We thus 
choose the hypothesis which minimizes the resulting bound. 14 This extension is particu- 
larly attractive since typically without prior knowledge we do not know which clustering 
algorithm will be effective for the dataset at hand. 

To conclude this section we note that El-Yaniv and Gerzon (2004) recently presented 
empirical studies of the above clustering approach. These empirical evaluations on a variety 
of real world datasets demonstrate the effectiveness of the proposed approach. 

6. Concluding Remarks 

We presented general explicit PAC-Bayesian bounds for transductive learning. We also 
developed a new prior construction technique which effectively derives tight data-dependent 
error bounds for compression schemes and for transductive learning algorithms based on 
clustering. 

With the exception of Theorem 22, which holds for any bounded loss function, all our 
transductive error bounds were presented within the simplest binary classification setting 
(i.e. with the 0/1-loss function and with non-stochastic labels). However, these results can 
be easily extended to multi-class problems and to stochastic labels. We hope that these 
bounds and the prior construction technique will be useful as a starting point for deriving 
error bounds for other known algorithms and for developing new types of transductive 
learning algorithms. 

We emphasize, however, that in the case of classification (i.e. the 0/1 loss), implicit but 
tighter error bounds for transduction were already known (e.g. Vapnik's result as stated 
in Corollary 9). Our bounds are explicit and can therefore be useful for interpreting and 
characterizing the functional dependency on the problem parameters. 



14. A better approach to combine the k clustering algorithms, especially if we expect that some of the 
algorithms will generate identical clusterings, is to construct one "big" prior for all of them. 
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In applications, when using compression schemes or our clustering-based transduction 
approach, one can plug-in any other PAC-Bayesian transductive bound (implicit or ex- 
plicit). For example, one can benefit by using the tighter Vapnik bound. In (El-Yaniv & 
Gerzon, 2004) and (Banerjee & Langford, 2004), the authors present empirical studies of 
the clustering approach of Section 5.2 by plugging in the Vapnik implicit bound and the 
similar implicit bound from (Blum & Langford, 2003), respectively. These empirical studies 
indicate that the proposed clustering-based transductive scheme can lead to state-of-the-art 
algorithms. 

An interesting feature of any transduction error bound for Setting 1, is that it holds 
for "individual samples"; that is, the full sample X m+U need not be sampled i.i.d. from 
a distribution and moreover, in this setting one cannot assume that it is sampled from a 
fixed distribution at all! Therefore, results for this setting must hold for any given sample. 
In this sense, the transductive bounds within Setting 1 are considerably more robust than 
standard bounds in the inductive setting. 
We conclude with some open questions and research directions. 

1. All our results are obtained within Vapnik's "Setting 1" of transduction, which must 
consider any arbitrary choice of the full sample. Considering Theorem 2 and Remark 4, 
it would be interesting to see if tighter results are possible within the probabilistic 
Setting 2. 

2. An interesting direction for future research could be the construction of more sophisti- 
cated priors. For example, in our compression bound (Corollary 25), for each number 
s of compression points we assigned the same prior to each dichotomy of every s- 
subset. However, in practice, when there is structure in the data, the vast majority 
of all these subsets and dichotomies cannot "explain" the data and should not be 
assigned a large prior. 

3. The bounds derived in this paper are based on a contribution from the deviation 
of a single hypothesis and a utilization of the union bound in the PAC-Bayesian 
style. More refined approaches, for example those based on McDiarmid's inequality 
(McDiarmid, 1989) and the entropy method (Boucheron, Lugosi, & Massart, 2003), 
are able to eliminate the union bound altogether. It would be interesting to see if 
such approaches can lead to tighter bounds in the current setting. 

4. When considering arbitrary (bounded) loss functions, the basic observation for trans- 
duction (due to Vapnik) that the effective cardinality of the hypothesis class (in the 
case of classification) is finite, is not necessarily valid. However, it is likely that one 
can still benefit from the availability of the full sample. One possible approach, men- 
tioned in Remark 7, would be to construct an empirical e-cover of TL based on the i\ 
norm. 

5. Finally, we note that the major challenge, of determining a precise relation between 
the inductive and transductive learning schemes remains open. Of particular inter- 
est would be to determine the relation between the inductive semi- supervised setting 
(where the learner is also given a set of unlabeled points, but is required to induce 
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an hypothesis for the entire space) and transduction. Our bounds suggest that trans- 
duction does not allow for learning rates that are faster than induction (as a function 
of m). On the other hand, it appears that the bounds we obtain for clustering-based 
transduction can be tighter than any known bound for a specific inductive learning 
effective algorithm. 
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Appendix A. Proof of Theorem 2 

Proof The proof we present is identical to Vapnik's original proof and is provided for the 
sake of self-completeness. Let A be some learning algorithm choosing an hypothesis hjy E TC 
based on S m U X u . Define 



^A\ x li Vl'i ■ ■ ■ ! x m+ui Vm+u) 



m+u 



-.lib -. II b^ Lb 

— Y]^(y^ h M x i)) — z2 t(yi,hA(xi)) 

i=\ j=m+l 

Rh A {X m ) — Rh A (X u )\ . 



Consider Setting 2. The probability that C4 deviates from zero by an amount greater than 
e is 



P= I(C A -£)dfi(x 1 ,y 1 )---dfi(x m+u ,y m+u ), 

Jx,y 

where I is an indicator step function, I(x) = 1 iff x > and I(x) = otherwise. Let T p , 
p = 1, . . . , (m + u)\ be the permutation operator for the sample (xi, yi); . . . ; (x m+u , y m+u ). 
It is not hard to see that 



„ (m+u)\ 

P= \l i — VT Yl I ( C A(%(x 1 ,y 1 ;...;x m+u ,y m+u ))-£) 

Jx,y {rn + u)\ frf 



> d/j,(xi,yi)- ■■d/j,(x m+u ,y m+u ) 



The expression in curly braces is the quantity estimated in Setting 1 and by our assumption, 
for any choice of the full sample, it does not exceed 5. Therefore, 

P< Sdfi(xi,yi)---dfi(x m+u ,y m+u ) = 5. 

Jx,y 
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